2023-05-19

Introduction

Principal Components Analysis is a multivariate method that helps you understand the structure of data that have been collected on \(N\) individuals and \(p\) variables. Each individual can be represented by a point in \(p\) dimensional space. Our motivating example will be test scores on \(N=88\) students across \(p=5\) topics.The first two exams were closed book, the last 3 open book.

##   mechanics vectors algebra analysis statistics
## 1        77      82      67       67         81
## 2        63      78      80       70         81
## 3        75      73      71       66         81

A motivating question is how should the exam scores be combined to give an overall score? One answer is to use the mean, but is there something better?

Principal Components

  • We will refer to the data matrix as \(\bf{X}\), and each row as \(\bf{x_i}\) = \((x_{i,1}, x_{i,2}, \ldots,x_{i,p})\).
  • We can see in the picture that we could rotate the coordinate axis to align differently with the data
  • There are many possible rotations, but we want to consider one where the axis are oriented towards directions of greatest variation in the point cloud.

Three important points

  • If we rotate the coordinate axis, as in the plot, the relationship between the data points has not changed at all.

  • The data points are still described by a matrix (\(\bf{X^\prime}\)) that has \(N\) rows and \(p\) columns.

  • The PCs are chosen to be orthogonal, just like our original coordinate system

Other Sources

Chapter 7 of MSMB (https://www.huber.embl.de/msmb/07-chap.html) has many other examples and a slightly different approach.

They cover linear regression, other decompositions and do some really thorough explanations of the singular value decomposition that underlies this.

My main source: Multivariate Analysis, Mardia, Kent, Bibby (1979)…which is still relevant and accurate today…

Simple PCs (Appendix for more details)

  • The PCs provide us with an alternative set of coordinates.
  • The first PC corresponds to the direction of most variation in the \(\bf{x}\)’s. Note that rescaling the \(\bf{x}\)’s would clearly affect the PCs.
  • After we compute the PCs we have three quantities of interest:
    1. The variability in each of the new directions (eigenvalues).
    2. The vectors (eigenvectors) that are used to form the new coordinates (they are linear combinations of the original points).
    3. The new set of features, the rotated values.

Motivating Example

Since we have five covariates, each student’s score can be represented in 5-D space. The principal components give us a different coordinate system than the one based on the exam scores.

## Standard deviations (1, .., p=5):
## [1] 26.210490 14.216577 10.185642  9.199481  5.670387
## 
## Rotation (n x k) = (5 x 5):
##                   PC1         PC2        PC3          PC4         PC5
## mechanics  -0.5054457 -0.74874751  0.2997888 -0.296184264 -0.07939388
## vectors    -0.3683486 -0.20740314 -0.4155900  0.782888173 -0.18887639
## algebra    -0.3456612  0.07590813 -0.1453182  0.003236339  0.92392015
## analysis   -0.4511226  0.30088849 -0.5966265 -0.518139724 -0.28552169
## statistics -0.5346501  0.54778205  0.6002758  0.175732020 -0.15123239

How to think about the PCs

The rotation matrix, in the previous slide tells you how to compute the new, 5 dimensional values. The first value is the linear combination where the \(l_i\) come from PC1, the second from PC2, and so on.

So we can compute a new \(n\) by \(k\) matrix, where for each of the \(k\) dimensions we can compute the new value for each individual. This gives us a set of five columns (rows are people, columns are variables) that are equivalent to the original data. The points are in some sense identical, just we have changed coordinate systems.

Back to the PCs

Let’s have a look at the coefficients - these are sometimes called rotations.

##                   PC1         PC2        PC3          PC4         PC5
## mechanics  -0.5054457 -0.74874751  0.2997888 -0.296184264 -0.07939388
## vectors    -0.3683486 -0.20740314 -0.4155900  0.782888173 -0.18887639
## algebra    -0.3456612  0.07590813 -0.1453182  0.003236339  0.92392015
## analysis   -0.4511226  0.30088849 -0.5966265 -0.518139724 -0.28552169
## statistics -0.5346501  0.54778205  0.6002758  0.175732020 -0.15123239
  • We can then see that PC1 is pretty close to the average, across all the exam scores.
  • PC2 is a contrast between the closed book and open book exams.
  • What do you think PC3 is representing?

The Rotated Variables

  • We can also compute the data matrix in the new reference frame - sometimes called the rotated variables
head( v1$x, 4)
##         PC1       PC2        PC3      PC4        PC5
## 1 -66.32077 -6.447125  7.0736275 9.646383 -5.4557651
## 2 -63.61810  6.754424  0.8599283 9.149064  7.5656517
## 3 -62.92626 -3.080258 10.2297139 3.723843  0.3841125
## 4 -44.53775  5.577218 -4.3780192 4.481675 -4.4065605

The Rotated Variables

  • They are uncorrelated
round(cor(v1$x), digits=3)
##     PC1 PC2 PC3 PC4 PC5
## PC1   1   0   0   0   0
## PC2   0   1   0   0   0
## PC3   0   0   1   0   0
## PC4   0   0   0   1   0
## PC5   0   0   0   0   1

The Variances

v1$sdev
## [1] 26.210490 14.216577 10.185642  9.199481  5.670387
  • the largest SD is about 26 and the smallest close to 6
  • if the data followed a multivariate Normal distribution (it does not) these would tell us about the size of the ellipses that describe the data

Notice that if we compute the SD for each column we get back the singular values.

apply(v1$x, 2, sd)
##       PC1       PC2       PC3       PC4       PC5 
## 26.210490 14.216577 10.185642  9.199481  5.670387

The Null Space

  • Sometimes we have \(p\) columns, but the point cloud really only occupies some lower dimensional space.
  • in the rotated coordinate system we only need 2 dimensions, so one of the rotated variables will be zero

An example of 3D data describing a 2D manifold

x= rnorm(20, sd=10)
y=rnorm(20, sd=5)
z= x+y
v2 = prcomp(cbind(x,y,z))
head(v2$x, n=3)
##              PC1       PC2           PC3
## [1,] -15.5899623  3.903817  1.776357e-15
## [2,]   1.2342134  7.462897  1.554312e-15
## [3,]   0.5423539 -4.324681 -1.124101e-15
round(v2$sdev, digits=3)
## [1] 14.105  5.083  0.000

The Null Space

  • It would be quite unusual in a real world example where the variables are measured with error for us to find a case where one or more of the eigenvalues is exactly zero
  • So we use a cut-off of some form that says that if the variability in one direction is smaller than \(\eta\) we don’t think that dimension will be very helpful.
  • In our single cell analysis pipeline we will typically choose some fairly large number of PCs (eg 50) to use for clustering the data, this is a form of dimension reduction

The Effect of Outliers

  • in our example we have test scores as percentages, so this limits the effect of outliers
  • just to emphasize the point, I will use some very large values in the last 3 columns
ns = c(1, 2, 850,950, 999)
ndata = rbind(examScor, ns)
tv = prcomp(ndata)
tv$sdev
## [1] 164.724987  19.864187  10.193072   9.771466   6.620818
v1$sdev
## [1] 26.210490 14.216577 10.185642  9.199481  5.670387

The Effect of Outliers

tv$rotation
##                    PC1          PC2        PC3         PC4         PC5
## mechanics  -0.01768134 -0.846855292  0.1026264 -0.52058224  0.03139179
## vectors    -0.02536899 -0.530897676 -0.1781267  0.82756977 -0.03005392
## algebra     0.51715704 -0.022016155 -0.1892014 -0.06919271 -0.83155222
## analysis    0.58650891 -0.006343286 -0.6179097 -0.10042743  0.51387643
## statistics  0.62257504 -0.021420175  0.7349348  0.17102371  0.20630860
v1$rotation
##                   PC1         PC2        PC3          PC4         PC5
## mechanics  -0.5054457 -0.74874751  0.2997888 -0.296184264 -0.07939388
## vectors    -0.3683486 -0.20740314 -0.4155900  0.782888173 -0.18887639
## algebra    -0.3456612  0.07590813 -0.1453182  0.003236339  0.92392015
## analysis   -0.4511226  0.30088849 -0.5966265 -0.518139724 -0.28552169
## statistics -0.5346501  0.54778205  0.6002758  0.175732020 -0.15123239

The use of PCs in regression

  • in regression we have a response, \(\bf{y}\) that we want to model in terms of matrix, \(\bf{X}\) of features.
  • suppose for example that 10 students had taken the first 4 exams, but had COVID and could not take the 5th. The instructor wants to estimate their score on the 5th exam.
y = examScor[,5]
covs = as.matrix(examScor[,1:4])
lm1 = lm(y~covs)

Regression cont’d

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -11.37822    6.98174  -1.630 0.106952    
covsmechanics   0.02217    0.09895   0.224 0.823265    
covsvectors     0.02574    0.13953   0.184 0.854092    
covsalgebra     0.72944    0.20961   3.480 0.000802 ***
covsanalysis    0.31293    0.13146   2.380 0.019581 *  
Residual standard error: 12.75 on 83 degrees of freedom
Multiple R-squared:  0.4793,    Adjusted R-squared:  0.4542 
F-statistic:  19.1 on 4 and 83 DF,  p-value: 3.612e-11

Regression on the PCs

  • now we are going to see what happens if instead of using the data in its original coordinate system we use the PCs
pccovs = prcomp(covs)
lm2 = lm(y~pccovs$x)

Regression on the PCs

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.30682    1.35892  31.133  < 2e-16 ***
pccovs$xPC1 -0.45441    0.05918  -7.678 2.84e-11 ***
pccovs$xPC2  0.37633    0.10885   3.457 0.000863 ***
pccovs$xPC3  0.08628    0.14739   0.585 0.559898    
pccovs$xPC4  0.52498    0.23109   2.272 0.025687 *  
Residual standard error: 12.75 on 83 degrees of freedom
Multiple R-squared:  0.4793,    Adjusted R-squared:  0.4542 
F-statistic:  19.1 on 4 and 83 DF,  p-value: 3.612e-11

Regression on the PCs

  • the fit of the model (and indeed the residuals, for example) is identical between the two
  • the data have not changed, we just changed how we referred to the data in our coordinate system
  • in the summary for lm2 we can see that the coefficients for PC3 and PC4 are not significant
  • you can check (use the rmarkdown doc for this lesson) that indeed removing them has little impact on the fit
  • removing them could provide some benefits in terms of simplicity etc and this is a form of smoothing

Reduced Model

lm3 = lm(y~pccovs$x[,1:3])
Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        42.30682    1.38665  30.510  < 2e-16 ***
pccovs$x[, 1:2]PC1 -0.45441    0.06039  -7.524 5.08e-11 ***
pccovs$x[, 1:2]PC2  0.37633    0.11107   3.388  0.00107 ** 
---
Residual standard error: 13.01 on 85 degrees of freedom
Multiple R-squared:  0.4448,    Adjusted R-squared:  0.4317 
F-statistic: 34.05 on 2 and 85 DF,  p-value: 1.378e-11
  • note that these are identical to the values for the first model with all PCs included

Caveats

  • the principal components are not scale invariant
  • changing the scale (pounds to kilograms, scaling by the variance etc) all change the set of principal components we find and hence make interpretation harder

Do we need all PCs?

  • it is useful at times to determine whether most (all?) of the important variation is contained in the first \(k\) PCs
  • under an assumption that we are sampling from a multivariate Normal distribution (strong but typically it doesn’t need to hold that well) we can say some things
  • with real data the hypothesis that any singular value (or eigenvalue) is zero is not meaningful, we always have some amount of random variation
  • but a test of \(\lambda_p = \lambda_{p-1}\) is meaningful
  • as is the more general \(\lambda_p = \lambda_{p-1} = \ldots = \lambda_{k+1}\), which is equivalent to saying that the first \(k\) components hold all the relevant information and after that it is just stochastic noise
  • Section 8.4.3 of Mardia Kent and Bibby…but this does not seem to be employed in the single cell field…

Now for more complicated things….

  • In a single cell experiment we typically have hundreds to thousands of cells.
  • We have 10’s of thousands of genes….
  • So each cell is a point in some 10-40K space…but we think that the there are useful summaries
  • We want to use PCA analysis as a way to do some form of dimension reduction - to those directions where there is a lot of variation in the data.
  • outliers can greatly skew the results and we need to be careful to ensure that our analysis is not too reliant on a few observations.

The basics of the process

  • get your single cell data and do QA/QC to remove genes (rows) and cells (columns) that seem to have technical or biological reasons to be be suspect
  • log transform and size normalize the data
  • do some sort of filtering for the top K most variable genes (K and how you measure variable are parameters)
  • do PCA
  • use the first M PCs to do clustering
  • revert back to the full count matrix and use UMAP and tSNE to visualize

Set up the data

  • We will examine the single cell data from workflow #3 in the OSCA book
  • http://bioconductor.org/books/3.17/OSCA.workflows/
  • This is a peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline.
  • the data were processed as described in that workflow - and then we will examine a few opportunities

Variance Explained

pct_var = attr(reducedDims(sc_exp_norm_trans)$PCA, "percentVar") |> 
   round(digits = 1)
pct_var
##  [1] 16.4  6.1  4.6  2.0  1.2  0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.3  0.3  0.3
## [16]  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2
## [31]  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2
## [46]  0.2  0.2  0.2  0.2  0.2
  • notice that after about 12 components we are at .3 and then quickly just .2 over and over
  • think back to the earlier comment that it could be useful to test if those later values are all equal…

LM on Clusters

  Dependent variable
Predictors Estimates std. Error p
clusters [1] -6.21 0.07 <0.001
clusters [2] 16.71 0.07 <0.001
clusters [3] -4.66 0.07 <0.001
clusters [4] 11.38 0.11 <0.001
clusters [5] -3.27 0.13 <0.001
clusters [6] 14.48 0.09 <0.001
clusters [7] -8.61 0.05 <0.001
clusters [8] -0.41 0.13 0.001
clusters [9] 4.98 0.15 <0.001
clusters [10] -3.09 0.08 <0.001
Observations 3985
R2 / R2 adjusted 0.971 / 0.971

Boxplot on Clusters

Is there info in the PCs we have not used?

  • we regress each PC in turn against the cluster labels and get the multiple R2
  • for each regression, the multiple R2 tells us about how much of the variation in that PC is explained by the clusters
  • if that value is especially low, as it is for PC6 below - then that dimension is not really reflecting the clustering
  • PC4 is also potentially interesting so we will have a look at it as well
## [1] 0.9712 0.8303 0.9018 0.6517 0.6260 0.0290 0.5458 0.2477 0.1105

What does this look like?

  • we plot PC4 vs PC1
  • what is that set of unusual looking points?

What is going on?

clusters[abs(pcs[,4])>15]
##  [1] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [39] 9 9 9 9 9 9
## Levels: 1 2 3 4 5 6 7 8 9 10
table(clusters)
## clusters
##    1    2    3    4    5    6    7    8    9   10 
##  566  526  514  204  170  330 1018  171  114  372
  • so we see that somehow PC4 is capturing a large amount of variation with cluster 9
  • we should probably try to track down what is happening there

The rotation matrix…

  • above we concentrated on the transformed features
  • but the rotations are also interesting
  • they tell us about the “loadings” on each of the genes for each PC
  • in the exam example we noted how the different rotations were things like the mean of the exam scores, or contrasts between exams
rotation = attr(reducedDims(sc_exp_norm_trans)$PCA, "rotation")
gn = row.names(rotation)
gn[1:10]
##  [1] "FO538757.2"    "AP006222.2"    "RP11-206L10.9" "FAM41C"       
##  [5] "NOC2L"         "HES4"          "ISG15"         "C1orf159"     
##  [9] "TNFRSF18"      "TNFRSF4"

PC4

  • let’s pull out the gene expression values for cluster 9 and look at highly variable genes
exprs9 = assays(sc_exp_norm_trans)$logcounts[, clusters==9]
sdbyg9 = rowSds(exprs9)
names(sdbyg9) = row.names(exprs9)
topsdbyg9 = sort(sdbyg9,dec=TRUE)[1:50]
topsdbyg9[1:10]
##      PPBP       PF4 HIST1H2AC     GNG11    MALAT1      SDPR      CCL5     TUBB1 
##  3.259413  3.190869  2.857211  2.767731  2.760935  2.760600  2.745204  2.566723 
##    TMSB4X      NRGN 
##  2.523412  2.446354
  • some of these are platelet factors, the first is a sign of platelet activation

Platelets

Plot all four genes by groups

Parallel Coordinate Plots

  • one issue with this plot is that any points that have the same values across all the variables appear as one line
  • 50 of the 114 are zero for all four genes
table(rowSums(exprs9S)==0)

PC6 vs PC1

- no correlation with zero fraction - no correlation with sizeFactor - looked at the loadings in the rotation matrix, four genes really stand out JUN, DUSP1, FOS and JUNB

PC6 correlates with JUN/JUNB/FOS/DUSP1 expression

Zero fractions

  • Irizarry find a strong correlation between PC1 and the fraction of genes not detected
 counts = assays(sc_exp_norm_trans)$counts
 zeroFrac = apply(counts, 2, function(x) sum(x==0))/nrow(counts)
 lmzf = lm(pc1~clusters-1+zeroFrac)

##doublets?

lmDoublets = lm(pc1~clusters-1+sc_exp_norm_trans$subsets_Mito_percent)

Results

  Dependent variable
Predictors Estimates Statistic p
clusters [1] -13.13 -20.48 <0.001
clusters [2] 9.60 14.57 <0.001
clusters [3] -11.69 -17.95 <0.001
clusters [4] 5.17 8.86 <0.001
clusters [5] -10.18 -15.71 <0.001
clusters [6] 7.75 12.37 <0.001
clusters [7] -15.73 -23.92 <0.001
clusters [8] -7.45 -11.29 <0.001
clusters [9] -2.19 -3.23 0.001
clusters [10] -10.27 -15.40 <0.001
zeroFrac 8.01 10.85 <0.001
Observations 3985
R2 / R2 adjusted 0.972 / 0.972

Gene Ontology

Next we are going to extract mappings to GO terms from the org.Hs.eg.db. These can then be used to carry out a regression analysis to see if the loadings on the PCs show any particular patterns with respect to particular GO terms. FIXME: need to be careful here and make sure we are getting not just the child, but all the roll-ups. And also make sure we have identified those genes that have no GO assignments - possibly lncRNA, mt(?), and so on.

library(org.Hs.eg.db)
library("Matrix")
##get all the GO mappings for the gene names used in our PCA - gn
ss1 = select(org.Hs.eg.db, keys=gn, columns=c("SYMBOL","GOALL"), keytype="SYMBOL")
ss2 = split(ss1$GOALL, ss1$SYMBOL)

##map from GO to symbol - so we can regress
## we will require 10 genes for any GO term to be included
## probably we should eliminate nodes with too many genes as well...
##
##ss3 is a list, each element of the list is the names/symbols
## of the gene that are attached to the GO category
ss3 = split(ss1$SYMBOL, ss1$GOALL)
ss3len = sapply(ss3, length)


ss4 = ss3[ss3len>9 & ss3len<50]
##FIXME - how to create a sparse matrix from ss4
## we are going to have columns as GO terms, rows as Genes
##FIXME - why are there multiple matches per GO term - below
numcols = length(ss4)
colNames = names(ss4)
rowLabs = unique(unlist(ss4))
p= vector("list", length=numcols)
names(p) = names(ss4)
for(i in 1:numcols)
  p[[i]] = match(unique(ss4[[i]]), rowLabs)

GenesByGO = sparseMatrix(j=rep(1:numcols, sapply(p, length)), i=unlist(p),x=1,
                         dimnames=list(Genes=rowLabs,GO=colNames))
GBG = as.matrix(GenesByGO)
save(GBG, file="GBG.rda")
save(ss4, file="ss4.rda")

Regress the rotation matrix on the GO categories

rot1 = rotation[,1]
names(rot1) = row.names(rotation)
rot1sub = rot1[row.names(GBG)]
lmGGO = lm(rot1sub ~ GBG[,1:1000]-1)

Adjust for lots of tests

  • when you do lots of tests it is important that you adjust the p-values to accommodate the fact that you did a lot of tests
ss = summary(lmGGO)
gbg_coefs=coef(ss)
row.names(gbg_coefs) = gsub("GBG[, 1:1000]", "", 
                            row.names(gbg_coefs), fixed=TRUE)
adjpvs = p.adjust(gbg_coefs[,4], method = "fdr")

top10 = sort(adjpvs, dec=FALSE)[1:10]
top10
##   GO:0000028   GO:0002523   GO:0002468   GO:0006956   GO:0005504   GO:0005767 
## 4.056568e-18 4.056568e-18 9.464568e-15 5.606591e-10 3.363120e-09 6.669258e-09 
##   GO:0002283   GO:0002544   GO:0002548   GO:0000027 
## 4.869244e-06 6.673035e-06 6.673035e-06 1.459962e-05

Top Groups

  • The top hit is GO:0000028, BP, ribosomal small subunit assembly
  • second hit GO:0002523, BP, leukocyte migration involved in inflammatory response
names(top10[1:2])
## [1] "GO:0000028" "GO:0002523"
gset1 = ss4[names(top10)[1]]
gset1
## $`GO:0000028`
##  [1] "RPS27"  "XRCC5"  "RPSA"   "RPS14"  "RPS14"  "RPS14"  "ABT1"   "PRKDC" 
##  [9] "RPS27L" "MRPS11" "ERAL1"  "ERAL1"  "RPL38"  "MRPS7"  "RPS15"  "RPS28" 
## [17] "RPS19"  "RPS19"  "RPS5"   "RRP7A"
gset2 = ss4[names(top10)[2]]
gset2
## $`GO:0002523`
##  [1] "S100A9" "S100A9" "S100A8" "S100A8" "SLAMF1" "TNF"    "CCR6"   "NINJ1" 
##  [9] "FUT7"   "JAM3"   "ALOX5"  "ALOX5"  "ADAM8"
gbg_coefs[names(top10)[1]]
## [1] NA
nn=names(top10)
gensetsize = sapply(ss4[nn], function(x) length(x))
gensetsize
## GO:0000028 GO:0002523 GO:0002468 GO:0006956 GO:0005504 GO:0005767 GO:0002283 
##         20         13         11         43         18         16         18 
## GO:0002544 GO:0002548 GO:0000027 
##         11         31         25

How unusual are these groups

Look at pc6

  • note this is not what I want to do, but on my laptop about 1.5K is what we can do
rot6 = rotation[,6]
names(rot6) = row.names(rotation)
rot6sub = rot6[row.names(GBG)]
lmGGO6 = lm(rot6sub ~ GBG[,1500:3000])

Find the GO categories with the largest coefficients in the regression

ss6 = summary(lmGGO6)
gbg6_coefs=coef(ss6)
row.names(gbg6_coefs) = gsub("GBG[, 1500:3000]", "", row.names(gbg6_coefs), fixed=TRUE)
adjpvs6 = p.adjust(gbg6_coefs[,4], method = "fdr")

top10 = sort(adjpvs6, dec=FALSE)[1:10]
top10
##   GO:0035994   GO:0036492   GO:0032873   GO:0042788   GO:0046329   GO:0033549 
## 3.393192e-43 1.425705e-20 1.078414e-18 1.700851e-18 1.123427e-17 1.893543e-13 
##   GO:0036491   GO:0033687   GO:0032570   GO:0045622 
## 1.893543e-13 3.871595e-13 8.917403e-12 6.822242e-11

GO:0035994 - response to muscle stretch

Appendix Slides

  • A linear combination of \(\bf{x}\) is \(\sum x_i *l_i\) for some constants \(l_i\)
  • The linear combination is a standardized linear combination (SLC) if \(\sum {l_i}^2=1\)

Singular Value Decomposition

Borrowed from Elements of Statistical Learning… The singular value decomposition (SVD) of the centered input matrix \(Nxp\) matrix \(\bf{X}\)
\[ \bf{X} = \bf{UDV^T}\] Here \(\bf{U}\) and \(\bf{V}\) are \(N × p\) and \(p × p\) orthogonal matrices, with the columns of \(\bf{U}\) spanning the column space of \(\bf{X}\), and the columns of \(\bf{V}\) spanning the row space. \(\bf{D}\) is a \(p × p\) diagonal matrix, with diagonal entries \(d_1 ≥ d_2 ≥ \ldots ≥ d_p ≥ 0\) called the singular values of \(\bf{X}\). If one or more values \(d_j = 0\), \(\bf{X}\) is singular.

Principal Components

The SVD of the centered matrix \(\bf{X}\) is another way of expressing the principal components of the variables in \(\bf{X}\). The sample covariance matrix is given by \(S = X^TX/N\), and we have \[X^T X = VD^2V^T,\] which is the eigen decomposition of \(\bf{X}^T\bf{X}\) (and of \(\bf{S}\), up to a factor \(N\)). The eigenvectors \(v_j\) (columns of \(\bf{V}\)) are also called the principal components directions of \(\bf{X}\). The first principal component direction \(v_1\) has the property that \(z_1 = X \cdot v_1\) has the largest sample variance amongst all normalized linear combinations of the columns of \(\bf{X}\).